The site https://www.mathgenealogy.org/, contains over 276,000 observations of Mathematics PhD grads and their supervisors. This is effectively a geneology of mathematical supervision (which should have some sizable effect on thinking, topics, and reading). The R package ggenealogy contains an example dataset from this source and facilitates the consumption and ploting of this type of data.
Given that my thesis was just certified I want to try to see if I can trace up the mathematical genealogy tree to visualize my thought-leading predecessors.
library(ggenealogy)
library(ggplot2)
library(magrittr)
data("statGeneal", package = "ggenealogy")
df <- statGeneal %>%
#dplyr::filter(parent != "") %>%
tibble::as_tibble()
print(df, n=3)
## # A tibble: 8,165 x 6
## child parent gradYear country
## <chr> <chr> <dbl> <chr>
## 1 Nicolas Chopin "Christian Robert" 2003 France
## 2 Melvin Springer "Everett Welker" 1947 UnitedStates
## 3 Shelemyahu Zacks "" 1962 UnitedStates
## school
## <chr>
## 1 Université Pierre-et-Marie-Curie - Paris VI
## 2 University of Illinois at Urbana-Champaign
## 3 Columbia University
## thesis
## <chr>
## 1 Applications of Sequential Monte Carlo methods to Bayesian Statistics
## 2 Joint Sampling Distribution of Mean and Standard Deviation for a Chi-square U~
## 3 Optimal Strategies in Randomized Factorial Experiments
## # ... with 8,162 more rows
hist(df$gradYear)
Ok, about 8k observations where “all the parent-child relationships where both parent and child received an advanced degree of statistics as of June 6, 2015.” This may or may-not contain the need people I am looking for.
Note that grad year:
Through trial and error I know that Di Cook is not in the data. The original paper does have Thomas Lumley, another professor of interest. But perhaps first I will manual look up Cook’s genealogy.
Di, Di’s supersivor, and “grand-supervisor” are not in the list, may have to go to plan B, looking at Thomas Lumley. After looking at both parents and children, I know that Thomas has 1 child in the data; Petra Buzkova. From the paper, we can see that the oldest predescor is David Cox.
lumley_p <- grepl("Lumley", df$parent, fixed = TRUE)
sum(lumley_p)
## [1] 1
df[lumley_p, ]
## # A tibble: 1 x 6
## child parent gradYear country school
## <chr> <chr> <dbl> <chr> <chr>
## 1 Petra Buzkova Thomas Lumley 2004 UnitedStates University of Washington
## thesis
## <chr>
## 1 Marginal Regression Analysis of Longitudinal Data with Irregular, Biased Samp~
## Prep the network info, more on this in `As network layout (iGraph)`.
ig <- dfToIG(df)
Let’s grab the paths while we are on the topic of names. Actually, if we go all the way to Buzkova, this is the example case in the paper.
pathCB <- getPath("David Cox", "Petra Buzkova", ig, df,
"gradYear", isDirected = FALSE)
plotPath(pathCB, df, "gradYear", fontFace = 4) +
xlab("Graduation Year") +
theme(axis.text = element_text(size = 10),
axis.title = element_text(size = 10)) +
scale_x_continuous(expand = c(0.1, 0.2))
Good, we have a start. We will want to find a way to traverse the hierarchy to find all of the ancestors without filling in the cousin nodes (or more preferably faintly filling them in). As an example poster, see https://www.mathgenealogy.org/posters/raich.pdf.
We can look at trees from a top-down or bottom-up view. Top-down works well, though bottom-up not so much, at least with this data and these functions. Of particular notice, is that the later case contains all 1:1 student:advisers. Studying the example poster we see that
l <- plotAncDes("David Cox", df, mAnc = 1, mDes = 6, vCol = "blue") +
labs(subtitle = "Interesting, but too many \n cousins of Thomas Lumley")
r <- plotAncDes("Thomas Lumley", df, mAnc = 6, mDes = 1, vCol = "blue") +
labs(subtitle = "Not very interesting, \n nb only 1:1 relationships")
library(patchwork)
l + r
I looked at a few of the late children from the plotPathOnAll and by chance saw Hilary Parker, who co-hosts the Not so Standard Deviations, https://nssdeviations.com/, which I am a huge fan of. Let’s see if she has a better tree:
parker_p <- grepl("Parker", df$child, fixed = TRUE)
sum(parker_p)
## [1] 8
parkers <- df[parker_p, ] %>% dplyr::pull(child)
plotAncDes("Hilary Parker", df, mAnc = 1, mDes = 6, vCol = "blue") +
labs(subtitle = "Hilary Parker")
Well, turns out none of (8) the Parker students have good trees. In my opinion the filter on the data requiring rows to be labeled as statistics focuses is too restrictive. Another short coming is that I haven’t seen an example of a student having multiple advisers.
We can also highlight a path against the backdrop of the rest of the data placed with iterating y-axis height. It looks neat, but seems a bit arbitrary.
plotPathOnAll(pathCB, df, ig, "gradYear",
bin = 200, nodeSize = .5, pathNodeSize = 2.5,
nodeCol = "grey60", edgeCol = "grey80",
animate = TRUE) ## plotly static interaction not animated.
In network and graphs, the iGraph package is a long standing go-to. We can also get to such an object with dfToIG(). This opens the door to all sorts of layouts and other network-related functions.
ig <- dfToIG(df)
class(ig)
## [1] "igraph"
ig
## IGRAPH 62f7545 UNW- 7123 8165 --
## + attr: name (v/c), weight (e/n)
## + edges from 62f7545 (vertex names):
## [1] Nicolas Chopin --Christian Robert Melvin Springer --Everett Welker
## [3] Shelemyahu Zacks -- James Sweeder --
## [5] Nino Kordzakhia -- Pavel Vanecek --Zuzana Prášková
## [7] Shyamal De -- Thomas Willke --
## [9] Vasant Huzurbazar-- Rita Engelhardt --William Cumberland
## [11] Fred Andrews -- Arthur Albert --
## [13] John Folks -- Arnold Goodman --
## [15] William Pruitt -- Thomas Birkner --
## + ... omitted several edges
getBasicStatistics(ig)
## $isConnected
## [1] TRUE
##
## $numComponents
## [1] 1
##
## $avePathLength
## [1] 2.801
##
## $graphDiameter
## [1] 10
##
## $numNodes
## [1] 7123
##
## $numEdges
## [1] 8165
##
## $logN
## [1] 8.871
plot(ig)
There is definitely potential to reproduce such geneology posters. Unfortunately, the data that was included in the package does not seem sufficient for our purposes.
## Packages used
pkgs <- c("ggenealogy", "ggplot2")
## Package & session info
devtools::session_info(pkgs)
## - Session info ---------------------------------------------------------------
## setting value
## version R version 4.1.2 (2021-11-01)
## os Windows 10 x64 (build 19044)
## system x86_64, mingw32
## ui RTerm
## language (EN)
## collate English_United States.1252
## ctype English_United States.1252
## tz Australia/Sydney
## date 2022-06-10
## pandoc 2.11.4 @ C:/Program Files/RStudio/bin/pandoc/ (via rmarkdown)
##
## - Packages -------------------------------------------------------------------
## package * version date (UTC) lib source
## askpass 1.1 2019-01-13 [1] CRAN (R 4.1.2)
## base64enc 0.1-3 2015-07-28 [1] CRAN (R 4.1.1)
## cli 3.3.0 2022-04-25 [1] CRAN (R 4.1.3)
## colorspace 2.0-3 2022-02-21 [1] CRAN (R 4.1.2)
## cpp11 0.4.2 2021-11-30 [1] CRAN (R 4.1.2)
## crayon 1.5.1 2022-03-26 [1] CRAN (R 4.1.3)
## crosstalk 1.2.0 2021-11-04 [1] CRAN (R 4.1.2)
## curl 4.3.2 2021-06-23 [1] CRAN (R 4.1.2)
## data.table 1.14.2 2021-09-27 [1] CRAN (R 4.1.2)
## digest 0.6.29 2021-12-01 [1] CRAN (R 4.1.2)
## dplyr 1.0.9 2022-04-28 [1] CRAN (R 4.1.3)
## ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.0.5)
## fansi 1.0.3 2022-03-24 [1] CRAN (R 4.1.3)
## farver 2.1.0 2021-02-28 [1] CRAN (R 4.1.2)
## fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.1.2)
## generics 0.1.2 2022-01-31 [1] CRAN (R 4.1.2)
## ggenealogy * 1.0.1 2020-03-04 [1] CRAN (R 4.1.3)
## ggplot2 * 3.3.6 2022-05-03 [1] CRAN (R 4.1.3)
## glue 1.6.2 2022-02-24 [1] CRAN (R 4.1.2)
## gtable 0.3.0 2019-03-25 [1] CRAN (R 4.1.1)
## htmltools 0.5.2 2021-08-25 [1] CRAN (R 4.1.1)
## htmlwidgets 1.5.4 2021-09-08 [1] CRAN (R 4.1.2)
## httr 1.4.3 2022-05-04 [1] CRAN (R 4.1.3)
## igraph 1.3.1 2022-04-20 [1] CRAN (R 4.1.3)
## isoband 0.2.5 2021-07-13 [1] CRAN (R 4.1.2)
## jsonlite 1.8.0 2022-02-22 [1] CRAN (R 4.1.3)
## labeling 0.4.2 2020-10-20 [1] CRAN (R 4.1.1)
## later 1.3.0 2021-08-18 [1] CRAN (R 4.1.2)
## lattice 0.20-45 2021-09-22 [1] CRAN (R 4.1.3)
## lazyeval 0.2.2 2019-03-15 [1] CRAN (R 4.1.2)
## lifecycle 1.0.1 2021-09-24 [1] CRAN (R 4.1.2)
## magrittr * 2.0.3 2022-03-30 [1] CRAN (R 4.1.3)
## MASS 7.3-57 2022-04-22 [1] CRAN (R 4.1.3)
## Matrix 1.4-1 2022-03-23 [1] CRAN (R 4.1.3)
## mgcv 1.8-40 2022-03-29 [1] CRAN (R 4.1.3)
## mime 0.12 2021-09-28 [1] CRAN (R 4.1.1)
## munsell 0.5.0 2018-06-12 [1] CRAN (R 4.1.1)
## nlme 3.1-157 2022-03-25 [1] CRAN (R 4.1.3)
## openssl 2.0.2 2022-05-24 [1] CRAN (R 4.1.3)
## pillar 1.7.0 2022-02-01 [1] CRAN (R 4.1.2)
## pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.1.2)
## plotly 4.10.0 2021-10-09 [1] CRAN (R 4.1.2)
## plyr 1.8.7 2022-03-24 [1] CRAN (R 4.1.3)
## promises 1.2.0.1 2021-02-11 [1] CRAN (R 4.1.2)
## purrr 0.3.4 2020-04-17 [1] CRAN (R 4.0.3)
## R6 2.5.1 2021-08-19 [1] CRAN (R 4.1.1)
## RColorBrewer 1.1-3 2022-04-03 [1] CRAN (R 4.1.3)
## Rcpp 1.0.8.3 2022-03-17 [1] CRAN (R 4.1.3)
## reshape2 1.4.4 2020-04-09 [1] CRAN (R 4.1.2)
## rlang 1.0.2 2022-03-04 [1] CRAN (R 4.1.3)
## scales 1.2.0 2022-04-13 [1] CRAN (R 4.1.3)
## stringi 1.7.6 2021-11-29 [1] CRAN (R 4.1.2)
## stringr 1.4.0 2019-02-10 [1] CRAN (R 4.1.2)
## sys 3.4 2020-07-23 [1] CRAN (R 4.1.2)
## tibble 3.1.7 2022-05-03 [1] CRAN (R 4.1.3)
## tidyr 1.2.0 2022-02-01 [1] CRAN (R 4.1.2)
## tidyselect 1.1.2 2022-02-21 [1] CRAN (R 4.1.2)
## utf8 1.2.2 2021-07-24 [1] CRAN (R 4.1.2)
## vctrs 0.4.1 2022-04-13 [1] CRAN (R 4.1.3)
## viridisLite 0.4.0 2021-04-13 [1] CRAN (R 4.1.2)
## withr 2.5.0 2022-03-03 [1] CRAN (R 4.1.2)
## yaml 2.3.5 2022-02-21 [1] CRAN (R 4.1.2)
##
## [1] C:/Users/spyri/Documents/R/win-library/4.1
## [2] C:/Program Files/R/R-4.1.2/library
##
## ------------------------------------------------------------------------------